# libraries
library(dplyr)
library(ggplot2)
library(forcats)
# source functions
names_functions = list.files(here::here("functions"))
for (f in names_functions)
source(here::here("functions", f))
rm(f, names_functions)Construct priors
This notebook constructs the priors for the model estimation. See model_description.htmlfor details. Some of the priors are independent of the specific scenario being estimated, e.g. the innovations to the reverse random-walk \boldsymbol{W}; others depend on the scenario and the resulting fundamental forecast, e.g. the prior mean m_{\mu_T} and variance V_{\mu_T} on the (transformed) latent voting intentions on election day.
To-dos
regularization of \boldsymbol{\hat{W}}-> James-Stein type estimator from pkgcorpcoruse historic polls to inform \boldsymbol{\hat{W}}-> discussed but not implemented- compare correlation of historical election results with correlation of ethnicity across states
- simulate trajectories of \mu_{1:T} for different values of \kappa given \boldsymbol{\hat{W}} -> prior predictive distribution
- simulate house effects by drawing from prior and using a sequence of polls from previous election as starting point -> prior predictive distribution
- differences in V_{\mu_T} across scenarios
add forecasts from fundamental model
Preliminaries
Initialize empty list to store priors for different scenarios
priors <- list() Load number of parties running in each state -> needed to set the prior on \mu_T
n_parties_by_state <- load_n_parties_by_geography("state")Load election results -> needed to estimate \boldsymbol{W}
election_results <- load_election_vote_shares()
n_elections <- length(unique(election_results$year))Calculate the dim names of \mu_T, i.e. the state and party that each entry refers to. These are also the dimnames of \boldsymbol{W} and V_{\mu_T}. These serve as a cross-check that the priors are ordered in the same way as in the data preparation and estimation!
names_mmu_T <- c()
parties <- load_parties()
for (s in seq(1, length(n_parties_by_state))) {
tmp <- paste0(names(n_parties_by_state)[s], "_")
for (p in seq(1, n_parties_by_state[s] - 1)) {
names_mmu_T <- c(names_mmu_T, paste0(tmp, parties[p]))
}
}Scenario-independent priors
covariance matrix \boldsymbol{W}
To specify the “prior” on the innovations to the reverse random-walk, it is useful to decompose the covariance as
\boldsymbol{W} = \kappa \times \boldsymbol{\hat{W}}
where \boldsymbol{\hat{W}} is a correlation matrix and \kappa a scale factor. Intuitively, \boldsymbol{\hat{W}} accounts for the comovement of the changes in latent voting intentions across parties and states while \kappa governs how large changes in voting intentions are!
To estimate \boldsymbol{\hat{W}} I consider the correlation of historical election outcomes across parties and states. These correlations are far from a perfect measure of what I am after because there is no variation in vote shares/intentions within an election campaign. But using the historic results has the advantage of being straight-forward to calculate!
An additional source of information could be the correlation across parties and states of polls from previous elections but it’s less obvious to me how these would need to be handled. Aggregated over time in some way? Does it make sense to compare a poll in Amperville on May 5th in 1984 with one in Circuiton on June 7th in 2023? What does that say about the evolution of the underlying voting intentions? Polls also don’t just track the voting intentions but also noise like house effects. Although these might be constant from day to day so actual movement in polls would be down to changes in voting intentions. On the whole, I am not sure how useful these data are or at least how I could extract useful information from them for my goal.
Lastly, demographic information like ethnicity could be useful in specifying the correlation of changes in voting intentions between states. But similarities or differences in ethnicities across states don’t necessarily reveal anything about correlations between parties.
When calculating the correlation of latent vote share across states and parties, I use all the available historical election results since
[…] although parties’ vote shares in each province oscillate from year to year—possibly for quantifiable reasons like the strength of the economy, and possibly for unquantifiable ones like the comparative appeal of their candidates or of their proposed policies—the baseline popularity of each party in each province is constant all the way from 1984 to 2024, as are the characteristics of each pollster. There is no equivalent in Dataland of, say, West Virginia in the United States shifting over time from a reliably Democratic-voting state to a reliably Republican-voting one, because there are no long-run time trends. As a result, when predicting elections in 2024, you should treat historical results and polls from 1984 as being just as informative as those from 2023 are. (background material on Dataland, my emphasis)
Plot historical election results
election_results %>%
tidyr::pivot_longer(
cols = c("cc", "dgm", "pdal", "ssp"),
names_to = "party",
values_to = "vote_share"
) %>%
filter(vote_share != 0) %>%
ggplot(aes(
x = year,
y = vote_share,
color = party
)
) +
geom_point() +
facet_wrap(~state) +
ggsci::scale_color_jco()# calculate correlation acroos elections
mat_res <- calc_cor_election_results(election_results)
corr_mat <- cor(mat_res)In addition, I consider a James-Stein-type shrinkage estimator of the correlation matrix (the degree of shrinkage is estimated from the data, see the documentation of cor.shrink)
corr_mat_shrink <- corpcor::cor.shrink(mat_res)Estimating optimal shrinkage intensity lambda (correlation matrix): 0.2019
# manually convert to matrix class first
class(corr_mat_shrink) <- "matrix"While the original correlation matrix is positive definite…
corpcor::is.positive.definite(corr_mat)[1] TRUE
corpcor::rank.condition(corr_mat)$condition[1] 51617.74
… the shrunk correlation matrix is much better conditioned (and also positive definitve)
corpcor::is.positive.definite(corr_mat_shrink)[1] TRUE
corpcor::rank.condition(corr_mat_shrink)$condition[1] 49.02109
Plot correlation of ethnicity across states
dat <- read.csv(
here::here("data", "dataland_demographics.csv"),
stringsAsFactors = FALSE
)
A <- as.matrix(dat[, grepl("_share", names(dat))])
rownames(A) <- dat$province
states_regions <- load_dataland_states_regions()
corr_mat <- cor(t(A))
corr_mat[lower.tri(corr_mat)] <- NA
as.data.frame(corr_mat) %>%
tibble::rownames_to_column(var = "state1") %>%
tidyr::pivot_longer(
cols = -state1,
names_to = "state2",
values_to = "value"
) %>%
filter(!is.na(value)) %>%
mutate(value = round(value, 2)) -> df_cor
# plot
ggplot(df_cor, aes(state1, state2)) +
geom_raster(aes(fill=value)) +
scale_fill_gradient2(
low = "blue",
high = "red",
mid = "white",
midpoint = 0,
limit = c(-1,1)
) +
labs(
x = "",
y = "",
title = "Ethnicity: correlation matrix"
) +
theme(
axis.text.x = element_text(size = 7,angle = 90),
axis.text.y = element_text(size = 7),
legend.position = "top"
) -> p
plotly::ggplotly(p)Function to plot submatrix
get_submatrix <- function(
mat,
str_row,
str_col
) {
return(
mat[grepl(
str_row,
rownames(mat)
),
grepl(str_col,
colnames(mat)
)
]
)
}get_submatrix(
corr_mat_shrink,
"Amperville",
"Voltagea"
) Voltagea_CC Voltagea_DGM
Amperville_CC 0.7731378 -0.3482786
Amperville_DGM -0.4169704 0.7726008
get_submatrix(
corr_mat_shrink,
"Amperville",
"Quantumridge"
) Quantumridge_CC Quantumridge_DGM
Amperville_CC 0.6421345 -0.2282983
Amperville_DGM -0.3092434 0.6422348
get_submatrix(
corr_mat_shrink,
"Amperville",
"Neuronia"
) Neuronia_CC Neuronia_DGM Neuronia_PDAL
Amperville_CC -0.04022698 -0.2775316 -0.2668184
Amperville_DGM 0.13815788 0.4155999 0.2830464
get_submatrix(
corr_mat_shrink,
"Infinitron Peninsula",
"Voltagea"
) Voltagea_CC Voltagea_DGM
Infinitron Peninsula_CC 0.6121097 -0.2004642
Infinitron Peninsula_DGM -0.2276086 0.6255029
get_submatrix(
corr_mat_shrink,
"Infinitron Peninsula",
"Quantumridge"
) Quantumridge_CC Quantumridge_DGM
Infinitron Peninsula_CC 0.7342418 -0.2873248
Infinitron Peninsula_DGM -0.3648000 0.7411214
Correlation matrix \boldsymbol{\hat{W}}
Use shrunk correlation matrix!
W_hat <- corr_mat_shrinkScale factor \kappa
kappa <- 0.01Calculate \boldsymbol{W} and store in list
W <- kappa * W_hat
priors[["A"]][["W"]] <-
priors[["B"]][["W"]] <-
priors[["C"]][["W"]] <-
priors[["D"]][["W"]] <-
priors[["E"]][["W"]] <- Wvariance of house effects \sigma^2_{\delta}
sig_ddelta <- sqrt(0.01) # specify in terms of standard deviation for Stan!Store in list
priors[["A"]][["sig_ddelta"]] <-
priors[["B"]][["sig_ddelta"]] <-
priors[["C"]][["sig_ddelta"]] <-
priors[["D"]][["sig_ddelta"]] <-
priors[["E"]][["sig_ddelta"]] <- sig_ddeltaScenario-dependent priors
scenarios <- load_scenarios()m_{\mu_T}
For the time being, the prior mean of \mu_T is based on the historical average of vote shares
Load fundamental forecast
df_fcast <- readRDS(here::here("fundamental_forecast", "fundamental_forecast.rds"))Loop over scenarios, convert to log ratio scale and store in list
for (scen in scenarios) {
df_fcast %>%
inner_join(
load_dataland_states_regions(),
by = join_by(province == state)) %>%
filter(scenario == scen) %>%
select(-scenario) %>%
mutate(party = toupper(party)) %>%
group_by(province) %>%
mutate(vote_share_lr = additive_log_ratio(vote_share)) %>%
filter(vote_share_lr != 0) %>%
arrange(region, province) %>%
tidyr::unite(
name,
province,
party,
sep = "_") -> df_m_mmu_T
# check calc
stopifnot(all(df_m_mmu_T$name == names_mmu_T))
# store in list
priors[[scen]][["m_mmu_T"]] <- df_m_mmu_T$vote_share_lr
names(priors[[scen]][["m_mmu_T"]]) <- names_mmu_T
rm(df_m_mmu_T)
}Adding missing grouping variables: `scenario`
Adding missing grouping variables: `scenario`
Adding missing grouping variables: `scenario`
Adding missing grouping variables: `scenario`
Adding missing grouping variables: `scenario`
V_{\mu_T}
The prior variance can vary across scenarios to reflect how far advanced the campaign is.
In addition, it can also be set to a smaller value - placing greater weight on the fundamental forecast - in those states where within a given scenario fewer or no polls are available
For the time being, however, I set a relatively uninformative prior that is the same in all scenarios and states
V_mmu_T <- diag(rep(1, sum(n_parties_by_state-1)))
dimnames(V_mmu_T) <- list(names_mmu_T, names_mmu_T)Store in list
priors[["A"]][["V_mmu_T"]] <-
priors[["B"]][["V_mmu_T"]] <-
priors[["C"]][["V_mmu_T"]] <-
priors[["D"]][["V_mmu_T"]] <-
priors[["E"]][["V_mmu_T"]] <- V_mmu_TExport
saveRDS(priors, file = here::here("priors", "priors.Rds"))
rm(priors)Prior predictive distribution
priors <- readRDS(file = here::here("priors", "priors.Rds"))